Part 1 -

1. Import and warehouse data:

  1. Import and warehouse data: [ Score: 3 points ]
  2. Import all the given datasets and explore shape and size.
  3. Merge all datasets onto one and explore final shape and size.
  4. Export the final dataset and store it on local machine in .csv, .xlsx and .json format for future use.
  5. Import the data from above steps into python.

2. Data cleansing: [ Score: 3 points ]

3. Data analysis & visualisation: [ Score: 4 points ]

Observations:

  1. We can see high corr. between Cylinder and displacement = 0.95 (very high)
  2. Correlation between Cylinder and hp = 0.84 (high)
  3. Correalation between Cyinder and weight = 09. (very high)
  4. Very low corelation between Weight and miles per gallon = -0.83 (very low)
  5. Low corr between HP and accelation = -0.69

Observations:

4. Machine learning: [ Score: 8 points ]

Observations:

5. Answer below questions based on outcomes of using ML based methods. [ Score: 5 points ]

Observation:

Observations:

Improvisation: [ Score: 2 points ]

Observations:

Part 2 -

Part 3:

Observations:

We'll now proceed with the SVM modelling:

  1. SVM on origna Dataset
  2. SVM on PCA generated dataset.

Observations:

END of PART 3.

Part 4-

- We can see that there's a positive relationship between some of the columns from dataset.
- Dataset was cleaned and prepared for the analysis.
- Before modelling, we have to make standarization of dataaset.

Observations:

Observations:

Part 5:

Questions: [ Total Score: 5 points]

  1. List down all possible dimensionality reduction techniques that can be implemented using python.
  2. So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python.

Dimensionality reduction can be classified in 3 parts:

Feature selection:
  1. Random Forest:

    • This is one of the most commonly used techniques which tells us the importance of each feature present in the dataset. We can find the importance of each feature and keep the top most features, resulting in dimensionality reduction.
  2. Missing Value Ratio:

    • If the dataset has too many missing values, we use this approach to reduce the number of variables. We can drop the variables having a large number of missing values in them.
  3. High Correlation filter:

    • A pair of variables having high correlation increases multicollinearity in the dataset. So, we can use this technique to find highly correlated features and drop them accordingly.
  4. Low Variance filter:

    • We apply this approach to identify and drop constant variables from the dataset. The target variable is not unduly affected by variables with low variance, and hence these variables can be safely dropped.
Components / Factor Based:
  1. Principal Component Analysis:

    • This is one of the most widely used techniques for dealing with linear data. It divides the data into a set of components which try to explain as much variance as possible.
  2. Factor Analysis:

    • This technique is best suited for situations where we have highly correlated set of variables. It divides the variables based on their correlation into different groups, and represents each group with a factor.
  3. Independent Component Analysis:

    • We can use ICA to transform the data into independent components which describe the data using less number of components.
Projection Based:
  1. ISOMAP:
    • When the data is strongly non-linear.
  2. t-SNE:
    • Works well when the data is strongly non-linear and also good with Visualization.
  3. UMAP:
    • For high dimensional data.

We'll now see the application of Dimensonlity reduction technique on text data(digit dataset):

Observations: